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ABSTRACT 


Altbough backward error recovery with recovery blocka(RB's) haa rac^/ed con* 
aidarabla attention from many r^rearchers, no attempt has bean made to stnictoire 
its implemmitation alternatives and then to evaiuate/analyze thehr effecthreness. In 
tMs paper we consider three different methods of imptemwiting RB's. These are the 
asynchronous, synchronous, and the pseudo recovery point bnplem«^tations. 


Asynchronous Ws are based on the concept of maximum autonomy in each of 
concurrent processes. Consequently, establishment of RB's in a process is made 
independently of others and unbounded roHback becomes a serious problem. 


In order to completely avoid unbrninded itdlback, it is necessary to ijynchronize 
the establishment of recovery blacks in alt cooperating processes. Process auton- 
oroy is sacrificed and processes are forced to wait for the commitment to establish* 
ing a recovery line, leading to inefficiency in time utilization. 

As a compromise between asynchronous and synchronous F^'s, we propose to 
insert psmido recovery points so that unbounded roHback may be avoided while main- 
taining process autonomy. 

We have developed probabilistic models for analyzing these ttiree methods 
under standard assumptions in computer performance analysis, i.e. expmential dis- 
tributions for related random variables. With these models we have estimated (i) the 
interval oetween two successive recovery Hnes for asynchronous RB's, (ii) mean loss 
ki computation power for the synchronized method, and (iii) additional overhead and 
rollback distance in case PRP's are used. 
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<. INTRODUCTION 

Rscttnt advances in VLSI and communication network tectinotoglea have made 
distributed processing feasible. Whie distributed processing can theoretically be 
exploited to provide «>mputation speeckip, cost-effeotivenMS and tolerance of com** 
ponent faNure, several problems remain to be solved before its full potent!^ can be 
reaiized hi practice. In this paper, we consider one such problem: that of implementing 
backward error recovery for concurrent processes with recovery blocks. 

The best known technique of backward error recovery, the recovery bUick 
(RB), was proposed by Homing [1] and Randeil [2]. it is a sequential program struc- 
ture that consists of an acceptance test, a recovery poinx(RP), and alternative algo- 
rithms for a given pror.ess. A process saves its state at its recovery point and then 
enters a recovmy region. At the end of a recovery block, the acceptance test is 
executed to check correctness of the computation results. In case an error is 
detected during the normal execution or the computation results fail to pass the 
acceptance test, the process rolls back to an old state saved at the previous RP and 
executes one of the other alternatives. 

Unfortunately, however, for cooperating concurrent processes the rollback of a 
process may cause other processes to roll back(this phenomenon is called rollback 
propagation ) because of process Interactbns and imperfect checking of global 
coirectness. Moreover, rollback may propagate to further RP's since recovery points 
of individual processes may not provide a globally consistent state for all processes 
involved. This rollback propagation continues until It reaches a recovery line at 
which a globally consistent state does exist. In the worst case, an avalanche of roll- 
back propagation (called the domino effect) can push the processes back to their 
beginnings, thus resulting in loss of the entire computation done prior to the error 
occurrence. 
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A dataHad daacHpUan ol tha domino affact can ba found In [3]. For oonvartl- 
Mca lat ua conaidar Figure 1 to vlauall 2 a roMbacIc tiropoigatlon. Frocaaa Pi baglna to 
roll back bacauaa of unauccaaaful acceptance toot ^7^ i Thla rollback propagataa to 
tha other two procaaaaa and P^ Eventually, tha whole ayatam haa to reatart 
from recovery line PL^ and tha computation done between and >174 haa to be 
diacardad. The interval between the reatart point and tha time point at which an 
error la detected or tha acceptance teat faila, called the rollback distancs, can be 
uaad to repreaent the computation loaa In rollback recovery. 

The domino effect is the major obstacle in implementing the recovery block 
scheme for concurrent processes. The designer is able to predict neither the time of 
the occurrence of process interactions nor that of the appearance of recovery linM. 
Nonetheless, it is not desirable to randomly place recovery points and acceptance 
tests without considering process characteristics. Otherwise, it Is possible to have a 
disaster such as unbounded rollback propagations, a large rollback distance, and a 
great number of largely useless recovery points occupying large amounts of memory 
apace, etc. Furthermore, decision on rollback propagation and determination of 
recovery lines will beco.ne more complex though they can be made in a centralized 
[4,6] or decentralized manner [6,7,8]. 

Several refinements have been proposed to overcome the drawbacks in this 
recovery block scheme. One approach is to put concurrent processes Into a con- 
trolled scope, either to synchronize the occurrence of acceptance tests or to direct 
process interactions. For the former, Randeil [2] has suggested the convtrsation 
scheme which requests every cooperating concurrent process to leave its accep- 
tance test at the same moment (called test line). He has also proposed a language 
structure in an abstract form for the conversation scheme. Other mechanizations of 
the conversatton scheme on the basis of the same concept but with more flexibility 
have been devised by Kim [9]. Synchronized rollback recovery schemes for transac- 
tions using a two-phase commitment protocol or transaction ordering are also studied 
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in [10.11,12]. Rum«H' h«8 proposed that tnformation be retained for cRreeted 
interactions from prodiraers to consumers so that roHbacIc propagatkm can be blocked 
[1 3,1 4]. Another approach Is to save addit> lal states based on the occurrence of 
interactions; for exan^le, the branch recovery point [16] and the system defined 
checkpoint (SDCP) [1 6]. 

In this paper we propose to employ pseudo recovery points^ (PF^'s) to allevi- 
ate the rollback propagation problem by allowing a proccms to restart at a Pftf> in 
case the process is forced to roll back by others as a result of rollback propagation. 
Therefore, we can classify these refinements into two categories, synchronized 
recovery blocks and pseudo recovery points, providing a contrast with the third 
category called asynchronous recovery blocks. 

To implement the rollback recovery schemes, we have to consider varlmjs 
trade-offs between these three categories and the characteristics of concurrent 
processes. A satisfactory compromise should include an acceptable delay in process 
completion due to rollbacks, the preservation of autonomy for each process, and pro- 
grammer transparency. Therefore, optimal solutions may be a combination of these 
three categories. A quantitative analysis is necessary to Justify the solutions. For 
example, it is necessary to determine the mean amour t of computation undone in 
case processes roll back, the optimal interval between two successive synchroniza- 
tions, the mean size of memory space required to save states, etc. However, because 
the program behavior is unknown and execution proceeds stochastically, accurate 
modelling is difficult. 

In this paper, employing standard assumptions in computer performance analysis, 
we have developed a model to quantitatively describe the characteristics of rollback 
recovery schemes as well as their effectiveness. In the following section, several 

^ Wm call It » psmudo recovtry polnt(PRP ) since there Is no acceptance test before the sauirtg of pro- 
cess state at a PBP. The sfatos recorded at PBP's may have been contaminated and thus can not be used to 
recover a failed process. 
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assumptions are discussed and th«i a modei for asynchronous recovery blocks is 
introduced. Using this mod^, we employ simuiatkm to present the probability distribu- 
tion of the interval between two successive recovery lines and the mean number of 
states recorded during that interval. In Sectiwis 3 and 4, the synch ronizatimi 
method and the hnplantation of pseudo recovery points are evaluated respectively. 
The paper concludes with Section 5. 

2. EVALUATION OF ASYNCHRONOUS RECOVERY BLOCKS 

Let us consider the history diagram in Figure 1 to illustrate the activities of 
cooperating concurrent processes i=l,3,...n. Process establishes its jth 
recovery point RPj without synchronizing with other processes. Interprocess com- 

iiMjnications are represented by. arrowed horizontal lines. Let set i4c|l n], i.e. a 

subset of concurrent processes. Then one may find a combination of RPj for all ieyl, 
which forms a recovery line for set A, denoted as RL^ for the rth recovery line. For 
sink>iicity superscripts in representing recovery lines will be omitted in the sequel as 
long as that does not result in ambiguity. The interval between" two successive 
recovery lines RLr and RLr^i in process Pi is a random variable and denoted by X^. 
Since a recovery line provides globally consistent states to all members of process 
set A, it is reasonable to assume that X^i is stochastically identical for alt i&A. Thus, 
Xr is used to represent the interval between the rth and (r + l)th recovery lines. 

2.1. Modeling Assumptions 

We make the following assumptions in our subsequent analyses. 

1. Autonomous Processes: Cooperative autonomy is regarded as the most important 
requirement in distributed processing. Each process should be executed accord- 
ing to its own program and environment, almost as if there were no processes to 
Interfere with. Thus, a process is executing independently of others as long as 
there is no conflict with others in accessing shared resources. Since 
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synchronization is not enforced in this category of recovery blocks (i.e. asynchio* 
nous recovwy blocks), processes transmit messages or establish their 
recovery points independently of other processes. 

2. Parfaot Acceptance Test: Acceptance tests shmild detect all errors udthin the 
local process during the execution of recovery blocks and thus ensure the 
(x>rrectness of local execution, it is in general difficult to guarantee the com- 
plete correctness, but at least the computation results that have passed the 
acceptance test should be ‘'acceptable"[3]> The local acceptance test may or 
may not detect external errors or erroneous messages because the local process 
is rwt aware of the global system and other processes. 

3. Probability Distribution of Interactions : Usually, process behavior is modeled 

as an ordered sequence which in turn is specified by the program and dependent 
on the execution condition. Even if the processing sequence is given, the inter- 
val between two successive interactions is variabie due to conditional branches. 
Locking and waiting at shared resources make it even more uncertain. Nonthe- 
less, for both tractability and simplicity we have adopted here constant reference 
rates in the multiprocessor and exponentially distributed intervals between two 
successive message transmissions In the computer network. The interval for two 
successive interacttons between and Pj is thus assumed to be exponentially 
distributed with mean 1/X^ and Xij=\ji for all i,j -1,2 n endi^j. 

4. Consistent CarnmunicatioTis: Let two messages m, and be sent from P^ 

to Pj. Consistent communications should satisfy : (i) every message sent from P^ 
to Pj will be received eventually by Pj, and (ii) m^and are received by Pj in 
the same order as that they are sent. Notice that in some packet-switched com- 
puter networks, messages are allowed to be received by the destination out of 
order. However, the order can be kept easily, for example, by time-stamping mes- 
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sages at the time of transmission. 

6. Distribution of Rscovsry Points: Because of process independence and the 
uncertainty of execution conditions, the appearances of recovery prints are ran- 
dom and difficult to model. To avoid complexity, establishment of recovery points 
in a process is assumed to be an independent Poisson process with parameter fii 
for process P^. 

2.2. A Model for Asynchronous Recovery Bkicks 

Since individual recovr^ry points by themselves may not be sufficient in rollback 
recovery due to the possibility of unbounded rollback propagations, we consider in 
this paper only the formation of recovery lines for asynchronous recovery blocks 
instead of separate individual recovery points. The requirements of a recovery line 
for processes P^, for i = l,S,...n, can be stated as follows: 

1 . Each recovery line has to include one recovery point RPj 
for every process P^. 

2. Let the moment of establishment of the jth recovery point 

in process P^ be f and let f"' be the moment of the gth interaction 
from P^ to Pi.. For every pair iRPj , RPj') in a recovery line, 
there does not exist an integer k such that [RPf], t [RPy ]] 

If (otherwise, t[RP]]'\). This implies 

that no communication from P^ to Pj (and vice versa) can be 
sandwiched between t[RPj] and t[Rff]. 

The basic idea underlying the model is to trace the occurrence of both recovery 
points and interactions. Based on the assumptions in Section 2.1, random variable Xr 
can be modeled by a continuous-time Markov process starting from a recovery line 
(.RLr) and ending at the next recovery line (i?I^+i). For a set of processes, 
|iei4j where = { 1,2 nj, two types of states are defined: 
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(a) . End atates Sr and transttiwia start from Sr whera aH 

procassas hava formed the rth recovery line, and and at Sr^-i 
up<Mi establishment of the (r -f l)th recovery line. 

(b) . Intermediate states 5 = (X|, xg. . . . , a^), where x^^O 

If the previous action of P^ was an Intwactlon, and 
Xf-1 If It was establishment of a recovery point. 

(Hicurrences of Interactions and recovery points in a process make the system 
go through these states. Note that both Sr and 5^^! are equivalent to state 
(1,1 1). We can establish the following transition rules: 

Rl.The system goes to state (xi,..,Xi-i,l,Xi^i,..,Xn) 

from state (ri,...Xi_i,O.Xi+i..„x„) with rate fii upon establishment 
of a recovery point in P^. 

R2. The system leaves state (xi....Xi_i.l.Xi^.,....Xj _i.l,x^ 4 .i...x„) and 
enters state (xi....ij_|,0,Xi+j...Xj_i,0.Xj+i,..,x„) with rate Xy 
if there is an interaction between P^ and Pj, 

R3. The system arrives at state (x,,...i4_i.0.x*+i...,x„) 
from state (xi,..,Xi_i,l,x<+i,..,x„) with transition 
rate 2 where Bi=\j | x^=0. jVi and;ei4J. 

it-Bx 

R4. The system can transfer directly from state Sr to state Sr*\ 

n 

with transition rate ^ 

Under these transition rules a Markov model is developed for three processes 
Px, Pi and Pq, and presented in Fig. 2. The single-arrow lines are unidirectional tran- 
sitions. The double-arrow lines are bidirectional transitions in which left-hand side 
parameters represent leftward transition rates and right-hand side parameters 
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rightward transition ratas. Tho numbar of statas for a sat of n procassas is 

Whan Mi-fXj-fM and fv aii i, j e A, tha modai can ba simpiifiad sinca ali 
intarmadiata statas 5=(a:|,a:g, . . . .*,») containing axactiy u 1'a in (x|,xg, . . . ,*») 
can ba raplacad by a singia stata A simpiifiad modai is obtained under the foiiow- 
hg transition ruias and presented in Fig. 3. 

R1'. For u = 0,1 n-1 , tha system wiii move to stste 

from stata with transition rate {n-u)fi 

when a new recov<^ point is formed. 

R2'. For aii u^2, the system is abie to ieave state 
for state with rate 

iw 

R3'. For ail u ^ 1, there is a transition from state to 
state with rate u(n-u)X. 

R4*. The system can transfer directly from the entry state Sr 
to the terminal state *Sr+l with transition rate n/x. 

2.3. The Analysis of Asynchronous Recovery Blocks 

With the model developed above, we can characterize the behavior of asynchro- 
nous recovery blocks in terms of the degree of interprocess communications and the 
distribution of recovery points. With the exponentially distributed interprocess com- 
munications and recovery points, Xr for ail r becomes stochastically identical. Let X 
denote a random variable representing the interval between two successive recovery 
lines, Li the number of states saved in process Pi during interval X. The probability 
distribution of X and the mean value of are derived below. 
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A. The distribuUmi erf X 

Let the state space ^=(0,1,2 where m=2" be the set of states of the 

foregoino continuous-time Markov process with the fotiowing conventimi for number- 
ing states - 

(a) . 5r— > state 0, 

n 

(b) . an intermediate state (x^.x^ x„) — > state -i-1), and 

t«i 

(c) . 5r+i ~> state m. 

Then, the Chapman-Koimogorov equation becomes 

^ir(0 = »r(OH 

where H is the {mxm) transition matrix [/i(u,v)] in which the {u,v) eiement is the 
transition rate from state u to state v, and n{t) is a vector whose ibth eiement is 
the probabiiity that the system is in state k at time t. The initiai condition is 
rr(0)=[l,0,0...,0]. The intervai between two successive recovery iines, X, is equal to 
the time needed for transition from state 0 to state m. Therefore, the density func- 
tion of X, namely /. (t), is given by 

E The mean value of U 

Since we are only concerned with the number of recovery points established by 
process Pi during intervai X, a discrete Markov chain is used. To compute the mean 
value of Li, a new Markov chain, denoted by is constructed based on the previ- 
ous m>. el with the following two steps. 

(a). Convert the previous model to a discrete model: 

The new chain, Y^, has the same states as the previous Markov process, 
n n n 

Let G - ^ ^^*® normaiization factor. The transition 

i«sl fc = i 
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probability from atata u to atata v in Y 4 la aqual to: for u. v s 0.1 m. 

p(u.v) s If andp(u.u) = 1~ ^ p(^.v) 

(b). Arrivala at a atata 5^' = (ai.xg x^) ¥^ara Xtsl can ba . 

groupad Into two elaaaaa. Ona la formad aa a raault of tha oocurrancaa of 
fV^'a in P^ and tha othar la formad aa a raault of intarprocaaa communica- 
tiona and aatabliahmanta of RP's in proceaaea othar than Accordingly, 

tha atata 5M=(X|,xg x« x^) with Xi=l can ba aplit into two atatiM 

and 5u"i rapraaanting tha two elaaaaa, raapacttvaly. Both atataa hava tha 
aama dapartura proceaaaa aa that of Su> Howaver, ail arrivals ^t atata 
due to the occurrence of recovery points in Pi enter state S^' whereas ail 
other transitions are made to 5u"- Hence the number of RP's associated 
with state 5^' la represented by that of arrivals at 5u‘* 

Figure 4 shows the conversion and the split of state 5g ~ (1.0,0) of the Markov 
chain for the three concurrent processes in Figure 2. With the new discrete model, 
Y 4 , we can calculate the the mean number of visits to state 5^', denoted as Ns^<, 
and the mean value of Li using the following relationship: 

£(£,)= S 

where state space of Y 4 . * 

Suppose process Pi detects an error or fails the acceptance test at one of its 

recovery pobits RPj, where j=l,Z If The rollback of Pi may propagate to k 

processes in the process set, = \Pi\ l£A\ where A=\l.Z n{. Let Dj be the 

rollback distance associated with the k processes and PPj for ; =1,2, Then, X 
represents the supremum of these random variables, i.e., D2^ • In Figure 5, the mean 
values of X are plotted as a function of n. It shows that X increases drastically 
when there Is an increase in the number of processes involved in the rollback 
recovery. The density function of X, fgit), is plotted in Figure 6. For ail the three 
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OMOB In Fig. 6, thor* a sharp ^HJtsa naar t sO, whtoh la dua to dlract transitions 
batwaan Sr and Srn and a ion gar transition tima naadad onca tha systam an tars 
Intamtadlata stataa. 

L«tps(^ £ which raprMants tha ralatlva ratio batwawi tha 

i«i a*i 

dansity of intarprocass communications and racovary point astabiishmants. With a 
fixad vahia of p and varying valuas of /u'a and X'a for thraa procassaa, wo havo par- 
formad cmnputar simulation and tha rasults ara tabulatad In Tabia 1. Tha minbna of X 
and Li occur whan tha distribution of racovary potoits among thaao procassas Is tmi- 
formly balancad (La., Tha distribution of intarprocass communications 

doa» play an important rola in determining tha probability of roittiack propagation but 
has littia affect on X and Li once the set of processes involved in rollback recover>' 
is determined. 

3. SYNCHRONIZED RECOVERY BLOCKS 

Tha sifflolest way of avoiding unbounded roiiback propagations is to synchronize 
tha establishment of recovwy points during process execution. In this method, 
Interactions are Inhibited between any pair of processes during their establishment 
of recovery points. There are three conceivable strategies in deciding when a syn« 
chronizatlon request is to be issued: (1) at a constant interval; (2) when the time 
elapsed since the previous recovery line exceeds a specified value; or (3) when the 
number of states saved after the previous recovery line Is larger than a prespecified 
.numbw. The implementation of the first strategy is simple since the synchronization 
request is issued without any knowledge of the state of execution. Nevertheless, 
this strategy may became very inefficient since it is possible to make synchronization 
requests immediately after the formation of recovery lines. For the second and third 
strategies, rollback distance and the number of saved states are prevented from 
becoming too large. However, In this case each process must be aware of the 
occurrc;nce of a recovery line whenever it is established. 
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U|»on rscalpt of a •yi^chronlzation raquaat, ovary prooaaa haa to prapara for 
astablishtoig a raoovary llna and alao haa to wait for tha comnltfiiont (for aat^abHahing 
a raeovaiy ttna) from othar prooaaaaa bafora It axacutaa an aocaptanea taat. Thua, 
all ooopvating prooaaaaa parfortn thJir aoo«^>tanoa taata at Ma aama Inatant upon 
raoeh/hno tha eomndtmants from all othar prooaaaaa. Lat P^-raoufy, for 
ba tha flaga In prooaaa P^ to Indleatv commitmanta from P^. Tha atapa for .’synchroni- 
zation bi aach prooaaa Pi ara daacribad aa followa: 

1 . aaacute "ita own normal prooaaa" until "aooa|.tanoa teat"; 

2. aat Pn~rtady :■ ON and than broadoaat Pn-rtady; 

3. while not (all PiS --ready * ON) do 
racalva meaaagas; 

If a maaaaga la Pu -ready then sat -ready :■ ON 
aloe reoord tha message 

4. do "acceptance test" and record process states. 

Establishment of recovery lines upon synchronization requests is shown in Figure 
7. Synchronization causes the ccmputatton power to be diminished because 
processes have to wait for the commitment (as in step 3). Let be the interval 
between tha receiving of a synchronization rfK^uest and the moment that process Pi 
reaches its next acceptance test (In step 1 ). Thm, according to the assumptions in 
Section 2.1, Vi is an exponentially distributed random variable with parameter Let 

Z-max\yi, yz y„j. The total loss In computation power is = ^ (Z-Vt). The 

i«l 

mean loss becomes 

cL^nfil-FAt))dt 

0 i*l Pi 

where Pa(0 is the distribution function of Z, and equals 

i«i 
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4 . IMPLANTATIO»l CF PSEUDO RECOVERY POINTS 

In the construction of a recovery block, usually, an acceptance test is a number 
of executable assessments provided by the programmer end then followed by a state 
saving. Note that process states can also be recorded upon any other requests if 
they are crxisldered useful in the rollback recovery. A pseudo recovery point (PRP) 
is defined as a recovery point that is established without a preceding acceptance 
test and is proposed here as an aitemative for avoiding the doiNno effect in a set of 
cooperating concurrent processes. With a iiKNiitor as the interprocess communication 
means, Kim [16] and Kant and Siiberschatz [16] discussed methods for impianting 
recovery points in a central manner. Simiiariy, we consider a method for impianting 
PRP's in the set of cooperating concurrent processes in a decentralized manner. 

To make every recovery point HPj in process maximally useful for rollback 
error recovery, there should be corresponding recovery points in the other processes 
that have to roll back as a result of the rollback propagation from P^. If such 
recovery points do not actually exist, a pseudo recovery point, PHPj'^', has to be 
inserted in process P^. for a given BPj in process P^. Further, in order to avoid the 
need of tracing recovery points at that particular moment, a PRP is established in 
each of the other processes involved for EPj. An algorithm for implanting PRP's is 
given below. 

(1) . When Pi establishes a recovery point RPj, it broadcasts a PRP 

implantation request to other processes. 

(2) . If Pi- receives the implantation request, it records its state as PRPf' 

upon the completion of the current instruction without an acceptance test. 

Then Pi- broadcasts the commitment Q.. 

(3) . Every process executes its own normal task after it establishes 

RPj or PRPf '. However, the messages sent «.o a procasik by P^. prior to Q 
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have to be retained In the state saved. 

Assume that process Pi detects an error before establishing BPj^.i and ^at 
this error is local to P*. The recovery line (called a psaud- ranovary lina, PRL)) 
formed by RPj and all PRPf'*a is able to recover these processes even if the error 
has already propagated to other processes. However, when the error detected in 
is due to error propagation from another process, (and therefore not local to P^), 
the contents of PRPf may have already been contanrinated if this error occurred 
prior to establishing PRPf. The restart from the pseudo recovery line formed by both 
RPj and all PRPf'*a may just reproduce the same error. Therefore, rollback prop*ga- 
tion may continue until every process involved has roiled back to a pseudo recovery 
Hne past at least one of its recovery points. Most of the processes involved are 
assured to reach the pseudo recovery line by rolling back past only one recovery 
point. A few processes may have to roil back past more than one HP due to random 
interprocesses interactions, and this can not be avdded unless a forced synchroni- 
zation is employed as discussed in Section ?. Consequently, the pseudo recovery 
line allows the processes to have the shortest rollback distance for backward error 
recovery without synchronization. Note that the pseudo recovery line is now 
guaranteed to contain correct states of all concerned processes. An algorithm of 
rollback recovery with these pseudo recovery points is given by: 

(1) . If an error is found in process P^, set p := i where p is a rollback pointer. 

(2) . Pp rolls back to its previous recovay point RPf. All processes P^. 

affected by the rollback of Pp roll back to their respective 
pseudo recovery points PRPf*'. 

(3) . For every affected processes P^', If the roilback has not passed its most 

recent recovery point, then set js ;= i' and go back to step 2. 
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In ngura 8 , th« establishment of PRP*a In processes Pi, P^, and ^3 is Uus- 
trated. When P^ fails Its acceptance test ATit eH processes have to restart from 
the pseudo recovery line formed by {RP\ . PRP\* , PRPl^) If Pi and Pz are affected 
by the rollbacic of P^ 

in the above algorithm, we can find that every process needs to preserve a 
recovery point for restart in case it fails. Also (n— 1) pseudo recovery points are 
needed for a process to form a pseudo recovery Hne with other processes where n is 
the total number of concurrent processes. It is therefore required to save n states 
for every RP, i.e. one RP and (n- 1 ) PRP's, and ail old RP's and PRP's except those in 

the pseudo recovery lines | PRLj \ i = 1 n, and RPj is the most recent RP in P^\ 

can be purged when a new recovery point is established, thereby reducing storage 
requirements for saving RP's and PRP's. Note that rollback distance is bounded by 

the supremum of \yi.yz Vni where y^ is the interval between two successive 

recovery points of process P^. The additional time overhead for every recovery point 
is (n— where tj. is the time needed to record the process state. These over- 
heads should be assessed against the gain of process autonomy and avoidance of 
unbounded rollback propagations. 

5. CONCLUSION 

We have quantitatively evaluated three different recovery blocks employed in 
backward error recovery for concurrent processing. The recovery block dealt with in 
this paper is defined in software and comprises an acceptance test and a state 
saving. The environment of concurrent processing considered here is not restricted 
to any particular method of interprocess communications or system structure. 

We have estimated the overhead required to avoid the domino effect when 
recovery or pseudo recovery points are employed. For both the synchronization 
method and the implantation of pseudo recovery points, the overheads are largely 
related to the construction of synchronization, RP's and PRP's. They would become 
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an unacceptabla burden vkdien synchronizatims and pseudo recovery pobits are con- 
structed frequently but interprocess communications rarely occur. At ^e other 
extreme, i.e. asynchronous recovery blocks, it may result in a longer rollback distance 
due to unlimited rollback propagations (in place of synchronization and PRP insertion 
overheads). 

In tills paper, we have considered the distribution of the interval between two 
successive recovery lines instead of the actual rollback distance. The rollback dis- 
tance after an error is detected is related to the probability of error occurrence, error 
detectkxi, and rollback propagation, etc. However, the interval X does represent an 
r;^ner bound for the real rollback distance. 

To select a suitable strategy or a combination of these three methods, we have 
to first examine the properties of concurrent processes such as the amount of inter- 
process communications and the distribution of recovery points. Then, we weigh the 
trade-off between the loss of computation power during normal operation and the 
increase in response time due to rollback recovery. For instance, the asynchronous 
method or a longer synchronization period is not acceptable for time-critical tasks in 
which a delay in system response beyond a certain value, the system deadline, 
leads to a catastrophic failure. The implantation of pseudo recovery points is also 
inefficient for concurrent processes when they establish recovery points 
frequentlyCthus requiring many PRP's to be implanted) and rarely communicate with 
each other. In general, if mere knowledge of the execution state in concurrent 
processes can be obtained, a better strategy for implementing recovery blocks can 
be derived. 
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Figure 3. The Simplified Model of Asynchronous RB’s for n Processes 
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Figure 4. The Construction c ’ State Sg' and Sg" of Discrete Markov Chain 
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Figure 5. Mean value of X vathe number of processes 
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Figure 6. The Density Function of X, /,(f ) 


ORtQINAL PAGE 13 
OF POOR QUALITY 


tync hroni zatlon 
request 


synchronization 

request 


Figure 


Pt p» 





T 


Vi 


W. 


— r 

Vs 

Pz2-raady\ 




ready, 






Vs 

P^-rtady 

O 


"T 

z 

.. 1 . 


. Establishment of Recovery Tiines upon Synchronization 
Requests 


ORIQINAL PAGE IS 
OF POOR QUALITY 


Pz 



CD : Recovery Point (RP) 

BZa : Peeudo Recovery Point (PRP) 

Note: ell occurrences of interactions are omitted 


Figure 8. Establishment of Pseudo Recovery Points for Rollback 
Error Recovery 




















